-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Frontend] Disaggregate prefill decode with zmq #11791
base: main
Are you sure you want to change the base?
[Frontend] Disaggregate prefill decode with zmq #11791
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
Ping when ready. NOTE for reviewers: do not merge until me and @russellb have a chance to review |
vllm/entrypoints/launcher.py
Outdated
clients.bind(url_client) | ||
logger.info(f"ZMQ Server ROUTER started at {url_client}") | ||
# Socket to talk to workers | ||
workers = context.socket(zmq.DEALER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simon-mo I am not familiar with ZMQ --- is dealer the right technical choice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://zguide.zeromq.org/docs/chapter3/#The-DEALER-to-DEALER-Combination
We need to proactively send messages to workers in this scenario.
-
ROUTER is not suitable for initiating messages because it doesn't know the identities of other receivers until it receives the first message. Only then can it establish routes for interaction.
-
REQ requires acknowledging each message before sending the next one, which doesn't meet our requirements.
-
DEALER allows us to actively send messages and supports asynchronous multi-send and multi-receive, making it the more suitable pattern. It's important to note that we need to maintain the DEALER's ID.
vllm/entrypoints/connect.py
Outdated
prefill_request['max_tokens'] = 1 | ||
route = "/v1/completions" | ||
# finish prefill | ||
async for x in execute_task_async(route, header, prefill_request, app.state.sockets_prefill): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A potential optimization (you don't need to implement it in this PR): return the first token generated by the prefill instance in this async for
, instead of reposting the request to decode instance and waiting for the first token from there.
print("Worker DEALER started at", url_worker) | ||
|
||
tasks = [asyncio.create_task(worker_routine(url_worker, context, i)) for i in range(5)] | ||
proxy_task = asyncio.to_thread(zmq.proxy, clients, workers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zmq sockets are not threadsafe. This cannot run in a background thread it must be in an asyncio task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means you cannot use the built in proxy since it does not use async sockets. In prior versions of VLLM, you will have to write your own proxy (its like 10LOC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test scripts test_connect_server1.py and test_connect_server2.py were used to simulate model responses. I've since removed them.
vllm/entrypoints/connect.py
Outdated
yield | ||
## close zmq context | ||
logger.info("term zmqctx") | ||
await app.state.zmqctx.term() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use destroy(linger=0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point! To ensure immediate termination and avoid potential blocking, I'll switch to using destroy(linger=0) instead of term(). I also replace it in the vllm/entrypoints/launcher.py
https://pyzmq.readthedocs.io/en/latest/api/zmq.html#context
After interrupting all blocking calls, term shall block until the following conditions are satisfied:
- All sockets open within context have been closed.
- For each socket within context, all messages sent on the socket have either been physically transferred to a network peer, or the socket’s linger period set with the zmq.LINGER socket option has expired.
vllm/entrypoints/launcher.py
Outdated
logger.info(f"ZMQ Worker DEALER started at {url_worker}") | ||
|
||
tasks = [asyncio.create_task(worker_routine(url_worker, app, context, i)) for i in range(5)] | ||
proxy_task = asyncio.to_thread(zmq.proxy, clients, workers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zmq sockets are not threadsafe. You cannot run this in a background thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I appreciate you pointing out potential thread safety issues with zmq sockets. You are completely correct; By default, they are not thread safe. I will prioritize finding a more thread safe alternative in the future to ensure robust operation in multi-threaded environments.
As zmq.proxy() is a synchronous function, executing it directly within the main thread can potentially block the server.
Currently, these two sockets are used exclusively within this thread. While I believe there are no immediate thread safety concerns, it's prudent to consider future scalability and maintainability. Can we address potential thread-safety issues in a subsequent PR?
https://zguide.zeromq.org/docs/chapter2/#ZeroMQ-s-Built-In-Proxy-Function
It’s exactly like starting the main loop of rrbroker.
https://github.com/booksbyus/zguide/blob/master/examples/Python/rrbroker.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robertgshaw2-neuralmagic Hi, Robert I have resolved this issue by threadproxy. It will create sockets in the proxy, which no one can access. So there won't be any thread safety issues
@panf2333 - Thanks for the PR! Disaggregated serving is a hugely important initiative for VLLM in 2025 I am responsible for the multiprocessing + asyncio + zmq architecture of VLLM, so I am going to review this in detail. I am having some trouble following the design here. Can you make a simple diagram that charts out what these objects are to ease in review? Thanks! |
@robertgshaw2-neuralmagic It's my pleasure. I'll put together a diagram and send it over shortly. |
@robertgshaw2-neuralmagic These are simple diagram , hoping to help you better understand this PR. I also updated the description of PR. The relationship with client connector and vllm serverThe zmq detail between connector and vllm server |
Signed-off-by: clark <[email protected]>
Signed-off-by: clark <[email protected]>
Signed-off-by: clark <[email protected]>
Signed-off-by: clark <[email protected]>
2.To more accurately reflect its purpose, we will rename connect.py to disagg_connector.py. Signed-off-by: clark <[email protected]>
Signed-off-by: clark <[email protected]>
…oy(linger=0) for immediate termination Signed-off-by: clark <[email protected]>
Signed-off-by: clark <[email protected]>
Signed-off-by: clark <[email protected]>
1bc97ec
to
0728a42
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the whole design yet, but I have one early comment: is all zmq communication local? If so, can you please use ipc://
sockets instead of tcp://
? That will avoid some security concerns.
@russellb I completely agree that security is a paramount concern. Given the Disaggregated serving feature's potential to dispatch requests to other nodes, it's crucial to establish a secure communication channel between the connector proxy, prefill node, and decode node. In order to connect the connector proxy, pre filled nodes, and decoding nodes, we should use 'tcp://' in vllm/entrypoints/disagg_connector.py in vllm/entrypoints/launcher.py
In the server side we will use "inproc://workers" to deal the message. |
This is big and complex enough that I would find it easier to discuss this at a design doc level. Do you have a design doc from planning this implementation? I'm not really comfortable with adding any additional multi-node zmq usage without additional non-trivial effort to secure these communications. |
@russellb I appreciate you raising this concern.
lark doc: https://qus2es1bg99i.larksuite.com/wiki/Pbi1wFUTaiBZneksfytuQxrSsTe?from=from_copylink google doc: https://docs.google.com/document/d/1ZwFij2OEx_K1xBx2EBx5FKfXQ9EJEGU6shYh-9MJdPs/edit?usp=sharing |
Signed-off-by: clark <[email protected]>
Signed-off-by: clark <[email protected]>
Signed-off-by: clark <[email protected]>
@russellb Hi Russell, for now, I've used 'ipc://' to address immediate security concerns. However, I'll be addressing network security comprehensively in a future PR. I plan to leverage pyzmq.auth to implement robust authentication and authorization mechanisms. |
Signed-off-by: clark <[email protected]>
I don't think that's sufficient. We also need a viable option for encryption, ideally with TLS. |
@russellb I believe the disaggregation feature might benefit from optional TLS encryption. While encryption enhances security, it may introduce a slight performance overhead. Do you mean we can provide a configuration option to enable TLS encryption? This will allow users to choose the security level they need. I think users prefer to deploy clusters within secure environments such as intranets, so they want to improve performance as much as possible. I will conduct in-depth research on auth and encryption before deciding on the selection. Before that zmq was only allowed to run locally. How about this? |
That's fine. I'm completely OK with using it local-only. |
Signed-off-by: clark <[email protected]>
Signed-off-by: clark <[email protected]>
Signed-off-by: clark <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Can you change the disaggregated prefill example file under the examples
folder? Let's provide some handle for newcomers to run disaggregated prefill example without figuring out how to correctly set all the CLI args.
Signed-off-by: clark <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
# Conflicts: # examples/online_serving/disaggregated_prefill.sh Signed-off-by: clark <[email protected]>
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
@robertgshaw2-redhat would be great if you can take a look, if it also looks good to you I'll enable automerge. |
Added vLLM Connect to initiate a proxy service and connect to the VLLM Server via ZMQ, improved the performance of prefill-decode disaggregation by 10-30% (TTFT), and 3X- 15X (ITL) on average.
This key change of this PR includes replacing HTTP with ZMQ for communication between the proxy and the VLLM server, and using socket pools to maintain persistent ZMQ connections, which reduces reconnection overhead.
We have attached the benchmark result and the detailed configuration to reproduce the result.
Benchmark
Parameters
Evaluation Steps
Design of ZMQ-based Client-Server Communication
High-level Overview
Design of ZMQ-based Communication